convert.py: BPE fixes? #2938

KerfuffleV2 · 2023-08-31T17:19:12Z

Trying to get https://huggingface.co/BAAI/Aquila-7B working but it... doesn't feel very close.

The main scores were added as int but added_tokens adds them as float: this fails in gguf since arrays have to be homogeneous. Adding it as the wrong time might have caused issues too. Fixed that.
Some models like Aquila have added tokens but no separate added_tokens.json. I added some logic to fall back to trying to find the added tokens in tokenizer.json if it's available.
At least in tokenizer.json (and the Aquila model I mentioned) the added tokens aren't necessarily unique with the main vocab and have to be filtered out if it's already in the vocab. Maybe added_tokens.json doesn't need this logic?
*All tokens were added as USER_DEFINED. However the token id to text rendering stuff in llama.cpp has no case for USER_DEFINED so it would return an empty string 100% of the time. Fixed to add tokens with NORMAL type.
Scores were getting added as negative token id. Scores aren't even used for BPE, right? I changed it to just use 0.0 like the Falcon converter. I don't think it makes a difference what they're set to.
There's some logic to try to handle models that seem like they use some stuff like <0xblah> for bytes same as LLaMA - for example https://huggingface.co/kfkas/Llama-2-ko-7b-Chat

You won't get anywhere without #2889

That's apparently not enough for the Aquila model to do anything except die. Apparently it has no newline token? We just expect it to exist and blindly look it up:

llama.cpp/llama.cpp

Lines 1752 to 1757 in e8422de

    
           // determine the newline token: LLaMA "<0x0A>" == 10 == '\n', Falcon 193 == '\n' 
        
           if (vocab.type == LLAMA_VOCAB_TYPE_SPM) { 
        
               vocab.linefeed_id = llama_byte_to_token(vocab, '\n'); 
        
           } else { 
        
               vocab.linefeed_id = llama_tokenize_internal(vocab, "\n", false)[0]; 
        
           }

- so it'll crash there unless you comment out that line.

Another issue with trying to get the Aquila model to do something is the JSON files are not utf-8, they're GBK apparently. On Linux at least you can convert it with something like iconv -f gbk -t utf8 < vocab.json >vocab.json.tmp and then rename over the original file. tokenizer.json also needs this treatment.

After all that, it sort of works for English text at least. Except it doesn't have (space) in its vocabulary so tokenizing and all output is goingtobeallruntogethergoodtimes. It can't seem to tokenize Chinese characters at all. I also tried without the GBK to UTF8 conversion and pasting in the wonky GBK stuff from the original files but that didn't seem to work either (but I don't know that it would paste correctly since stuff on my system is set to use UTF8).

As for the other Korean model, one problem is it has no vocab.json which convert.py's BPE mode requires but that's pretty easy to extract from tokenizer.json. It actually tokenizes properly (I think?) but only produces garbage output whether you speak Korean or English to it:

Trying Korean

main: prompt: '옛날 옛적에 작은 여우가 있었어요'
main: number of tokens in prompt = 14
 45555 -> '옛'
 44939 -> '날'
    35 -> ' '
 45555 -> '옛'
 44824 -> '적'
 31054 -> '에'
    35 -> ' '
 43089 -> '작은'
    35 -> ' '
 31457 -> '여'
 37719 -> '우가'
    35 -> ' '
 44815 -> '있'
 38248 -> '었어요'
main: static prompt based on n_keep: '옛날 옛적에 작은 여우가 있었어요'

sampling: repeat_last_n = 0, repeat_penalty = 1.040000, presence_penalty = 0.000000, frequency_penalty = 0.060000, top_k = 70, tfs_z = 0.950000, top_p = 2.000000, typical_p = 0.250000, temp = 1.200000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 200, n_batch = 4, n_predict = 500, n_keep = 14


옛날 옛적에 작은 여우가 있었어요는,는의,를가의에ing에가:가>가의에에를가>의가>를의를가는에,의가=가,의,가,가=가,o>t의는,>>r는t>r>o=r,o.w.t.o.o.r.

Trying English

main: prompt: ' Once upon a time, in a dark forest, there lived a little fox'
main: number of tokens in prompt = 32
    35 -> ' '
 26222 -> 'Once'
    35 -> ' '
   786 -> 'up'
   265 -> 'on'
    35 -> ' '
 29874 -> 'a'
    35 -> ' '
  2230 -> 'time'
 29892 -> ','
    35 -> ' '
   262 -> 'in'
    35 -> ' '
 29874 -> 'a'
    35 -> ' '
 26031 -> 'dark'
    35 -> ' '
  1454 -> 'for'
   342 -> 'est'
 29892 -> ','
    35 -> ' '
 12711 -> 'there'
    35 -> ' '
 29880 -> 'l'
  2347 -> 'ived'
    35 -> ' '
 29874 -> 'a'
    35 -> ' '
 29880 -> 'l'
  1992 -> 'ittle'
    35 -> ' '
  8944 -> 'fox'
main: static prompt based on n_keep: ' Once upon a time, in a dark forest, there lived a little fox'

sampling: repeat_last_n = 0, repeat_penalty = 1.040000, presence_penalty = 0.000000, frequency_penalty = 0.060000, top_k = 70, tfs_z = 0.950000, top_p = 2.000000, typical_p = 0.250000, temp = 1.200000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 200, n_batch = 4, n_predict = 500, n_keep = 32


 Once upon a time, in a dark forest, there lived a little fox은2의은의,는이,은에(,는(32의2(이은3에(022에은(의2은,에32,의은2(은이은2(이은의,32(에3의0(에3은(|,은,에(_에_,322,,은,23(023(은은,,은0에3은

This could be merged but hopefully we can talk about the existing problems and maybe figure out a way to fix more stuff first.

KerfuffleV2 · 2023-08-31T17:27:20Z

@klosax since you're the one that suggested testing with that Aquila model: #2842 (comment)

ggerganov

These change should be OK to merge.

We cannot expect the current built-in BPE tokenizer to work with unicode chars. I've removed this support when porting the reference implementation from: https://github.com/cmp-nct/ggllm.cpp

It's just too much code for my taste.
If we cannot figure out an elegant way to support the BPE tokenizer (i.e. without 1000s of lines of unicode tables and handling), we probably have to drop support for it and require the user code to implement the tokenization. We can potentially expose the vocab through the API if necessary.

KerfuffleV2 · 2023-09-01T12:53:33Z

Thanks for the review.

Let me ask this: Is there a model that convert can handle in BPE mode which is expected to work for running in main, etc? Just from the BPE models other people have mentioned at that I've seen it seems like the LLaMA models adapted to use BPE use it specifically because it deals with unicode characters better than the SPM one which has pretty inefficient tokenization for stuff like Chinese characters.

In other words, is there currently a practical use case for --vocabtype bpe in convert.py? Because if not, that could be confusing. Right now there's no indication to the user that setting the vocab type to bpe is less supported. Perhaps we can add some warnings to the script and change the type to experimental-bpe. Something like that, so users don't just download a BPE LLaMA model and expect it to work.

I don't know too much about which LLaMAs use BPE, which should work, etc. With a little more information I'd be happy to add something like what I mentioned to convert.py. What do you think?

ggerganov · 2023-09-02T12:48:41Z

In other words, is there currently a practical use case for --vocabtype bpe in convert.py?

I don't think so. We support LLaMA models that use the original SPM tokenizer. All other cases are experimental.
BPE is partially implemented and I have tested it only with Falcon-7B and Falcon-40B.

so users don't just download a BPE LLaMA model and expect it to work.

Inference works, it's just the tokenization that does not work.

KerfuffleV2 · 2023-09-02T12:55:44Z

BPE is partially implemented and I have tested it only with Falcon-7B and Falcon-40B.

Falcon models aren't converted using convert.py though. For convert.py specifically, do you think we should do anything different for the BPE mode? Like I mentioned, changing the type to experimental-bpe, adding a warning, etc?

Maybe there should be something like a none vocab type that just doesn't include vocab with the model? (I think that's more complex so would probably be a different pull.)

edit: If you don't want to make further changes like that, please feel free to merge this. I think it's better than the existing behavior at least.

ggerganov · 2023-09-03T05:52:07Z

Ok, let's merge for now. I'm not very sure what to do about Falcon - will figure it out as we go

goerch · 2023-09-18T20:45:15Z

@KerfuffleV2 : please check if #3252 improves the situation. And yes, I'm still converting Falcon-7B with convert-falcon-hf-to-gguf.py. Is there a better way?

convert.py: BPE fixes?

25c0d79

KerfuffleV2 mentioned this pull request Aug 31, 2023

convert.py : handle special tokens #2820

Closed

Remove unnecessary conditional in addl token error handling

822127e

ggerganov approved these changes Sep 1, 2023

View reviewed changes

KerfuffleV2 mentioned this pull request Sep 1, 2023

llama : fix bpe tokenize from byte #2889

Merged

ggerganov merged commit cff7b0b into ggerganov:master Sep 3, 2023
4 checks passed

KerfuffleV2 mentioned this pull request Sep 19, 2023

Work on the BPE tokenizer #3252

Merged

KerfuffleV2 deleted the fix-convert-bpe branch November 17, 2023 03:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

convert.py: BPE fixes? #2938

convert.py: BPE fixes? #2938

KerfuffleV2 commented Aug 31, 2023

KerfuffleV2 commented Aug 31, 2023

ggerganov left a comment

KerfuffleV2 commented Sep 1, 2023

ggerganov commented Sep 2, 2023

KerfuffleV2 commented Sep 2, 2023 •

edited

Loading

ggerganov commented Sep 3, 2023

goerch commented Sep 18, 2023

	// determine the newline token: LLaMA "<0x0A>" == 10 == '\n', Falcon 193 == '\n'
	if (vocab.type == LLAMA_VOCAB_TYPE_SPM) {
	vocab.linefeed_id = llama_byte_to_token(vocab, '\n');
	} else {
	vocab.linefeed_id = llama_tokenize_internal(vocab, "\n", false)[0];
	}

convert.py: BPE fixes? #2938

convert.py: BPE fixes? #2938

Conversation

KerfuffleV2 commented Aug 31, 2023

KerfuffleV2 commented Aug 31, 2023

ggerganov left a comment

Choose a reason for hiding this comment

KerfuffleV2 commented Sep 1, 2023

ggerganov commented Sep 2, 2023

KerfuffleV2 commented Sep 2, 2023 • edited Loading

ggerganov commented Sep 3, 2023

goerch commented Sep 18, 2023

KerfuffleV2 commented Sep 2, 2023 •

edited

Loading